Data Bootcamp Final Project
Yuxin Xia (yx784@nyu.edu)
Dec.15th 2016
Overview:
There are many reasons for people going to colleges, such as obtaining knowledge, meeting new people and expanding social networks. Within these benefits, getting a diploma is an attractive one. In most cases, people believe that diplomas will benefit their career in the future and help them earn more.
In this project, I will try to quantity the economic benefits of getting a degree(including bachelor, master and doctor degrees) with two main data availabe online: percentage of workers with different degrees and the weekly salary of each occupation. Compared with the tuition people paid to get a degree, this project can help people see if their college education is worthwhile from an economic perspective of view.
Related Website:
In [1]:
import sys # system module
import pandas as pd # data package
import matplotlib as mpl # graphics package
from plotly.offline import iplot, iplot_mpl # plotting functions
import plotly.graph_objs as go # ditto
import plotly
plotly.offline.init_notebook_mode() # run at the start of every ipython notebook
First, I am going to show a general picture of the annual wage for people with different education levels. The table I got contains the median annual wage for people of different degrees.
Data Sourse:
title of the file: Employment, wages, and projected change in employment by typical entry-level education (Employment in thousands)
go to page and click on “Other available formats: (XLSX)” on the top to get the file downloaded to the laptop.
I uploaded the file to Github.
In [2]:
#read the file from github and only take the sheet with name "summary"
data_general = pd.read_excel("https://github.com/yuxinXia96/final-project/raw/master/education.xlsx",\
sheetname = 'Summary')
In [3]:
#drop rows with missing data
data_general= data_general.dropna()
#keep columns: oppupation and annual wage, drop cloumns: number of worker, percentage distribution, employment change
data_general = data_general.drop(['Unnamed: 1','Unnamed: 2','Unnamed: 3'],axis = 1)
#rename the columns
data_general = data_general.rename(columns = {'Employment, wages, and projected change in employment by typical \
entry-level education (Employment in thousands)':'Education','Unnamed: 4':'Median annual wage(2015)'})
In [4]:
#get the medium annual wage within all the workers, add it as a column
data_general['Mean'] = data_general['Median annual wage(2015)'][2]
#drop the row with information about the total population
data_general = data_general.drop([2])
In [5]:
#the table looks like:
data_general
Out[5]:
In [6]:
#plot
general_benefit = dict(type="scatter",
name="median annual wage for different degrees",
x=data_general['Education'],
y=data_general["Median annual wage(2015)"],
marker={"color": "blue"},
mode = 'markers-lines',
)
general_mean = dict(type="scatter",
name="Average annual wage",
x=data_general['Education'],
y=data_general["Mean"],
marker={"color": "purple"},
mode = 'lines',
)
layout = dict(
yaxis={"title": "annual wage"},
title="Annual Wage for People of Different Degrees",
)
iplot(go.Figure(data=[general_benefit,general_mean],layout=layout))
Although there are some exceptions (exceptions may raise from decresed working experience), generally sepaking, higher education leads to higher annual wage. However, the benefits of getting a degree differ from occupation to occupation. I am going to show the economic benefits of a degree to some detailed occupations.
It’s clear that the higher education a person has, the easier for him/her to find a particular job.
I found a file with the percentage of educational attainment for workers ordered by detailed occupations. To quantify the easiness of finding a job, for each education level, I used the cumulative percentage of workers with that occupation.
For each occupation, I multiplied the percentage of the workers with the weekly salary of that occupation. Then I multiplied the result with 52(52 weeks in year)to get the annual income.
The changes of annual income with different education levels can show the economic benefits brought by education.
Data Sources:
In [7]:
#read the weekly earning file from the Internet. Notice that the type of the file is list.
wage = pd.read_html("http://www.bls.gov/cps/cpsaat39.htm")
#data_wage[0] and data_wage[1] have similar data, so only keep one
wage = wage[1]
#convert list into dataframe
data_wage = pd.DataFrame(wage,columns = ['Occupation','2015','Total'])
#see the size of the file
data_wage.shape
Out[7]:
In [8]:
#some adjustment to get the data I need -- occupation,number of workers, weekly salary
#total wage: remove $ sign
data_wage = data_wage.replace(to_replace='$809',value=809)
#replace '-' with missing data NaN
data_wage = data_wage.replace('-',pd.np.nan)
#change the data type of 'Total'
data_wage['Total'] = data_wage['Total'].astype(float)
#drop the rows with missing data
data_wage = data_wage.dropna()
#rename the column labels
data_wage = data_wage.rename(columns = {'2015':'number of workers','Total':'weekly salary'})
#see the first 5 rows of the table
data_wage.head(5)
Out[8]:
In [9]:
#read education data from the file I have uploaded to the Internet, only read the sheet I need
data_education = pd.read_excel("https://github.com/yuxinXia96/final-project/raw/master/occupation.xlsx",sheetname=11)
#see the size of the table
data_education.shape
Out[9]:
In [10]:
#some adjustments to get the cumulative percentage of each degree for every occupation
#basically add the pertage of workers with education equal or lower than a certain degree
data_education['%high school']=data_education['Unnamed: 2']+data_education['Unnamed: 3']
data_education['%bachelor']=data_education['Unnamed: 2']+data_education['Unnamed: 3']+data_education['Unnamed: 4']\
+data_education['Unnamed: 5']+data_education['Unnamed: 6']
data_education['%master']=data_education['Unnamed: 2']+data_education['Unnamed: 3']+data_education['Unnamed: 4']\
+data_education['Unnamed: 5']+data_education['Unnamed: 6']+data_education['Unnamed: 7']
data_education['%doctoral']=data_education['Unnamed: 2']+data_education['Unnamed: 3']+data_education['Unnamed: 4']\
+data_education['Unnamed: 5']+data_education['Unnamed: 6']+data_education['Unnamed: 7']+data_education['Unnamed: 8']
data_education = data_education.drop(['Unnamed: 1','Unnamed: 2','Unnamed: 3','Unnamed: 4','Unnamed: 5'],axis=1)
data_education = data_education.drop(['Unnamed: 6','Unnamed: 7','Unnamed: 8'],axis=1)
#treat 0 percentage of workers in a certain occupation as missing value
data_education = data_education.replace(0,pd.np.nan)
#get rid of the rows with missing value
data_education = data_education.dropna()
#drop the first row which is not an observation but some descriptions
data_education = data_education.drop([0])
#rename the column lables
data_education = data_education.rename(columns=
{'Table 1.11 Educational attainment for workers 25 years and older by detailed occupation, 2014-15':'Occupation'})
#get rid of all the (1) in the occupation list
data_education = data_education.replace('\(1\)','', regex=True)
#see the top 5 rows of the table
data_education.head()
Out[10]:
In [11]:
#merge two tables(weekly wage and education)
benefits = pd.merge(data_wage, data_education, # left and right df's
how='left', # add to left
on='Occupation' # link with this variable
)
In [12]:
#calculate the change in weekly salary
benefits['%added bachelor']=benefits['%bachelor']-benefits['%high school']
benefits['%added master']=benefits['%master']-benefits['%bachelor']
benefits['%added doctoral']=benefits['%doctoral']-benefits['%master']
#calulate the change in annual salary
benefits['B benefit']=benefits['%added bachelor']*benefits['weekly salary']*52
benefits['M benefit']=benefits['%added master']*benefits['weekly salary']*52
benefits['D benefit']=benefits['%added doctoral']*benefits['weekly salary']*52
#drop rows with missing value
benefits = benefits.dropna()
In [13]:
#plot
salary1 = dict(type="scatter",
name="economic benefit of a bachelor degeree",
x=benefits['Occupation'],
y=benefits["B benefit"],
marker={"color": "yellow"},
mode = 'markers',
# size = benefits['number of workers']
#remove ticks on x axis
)
salary2 = dict(type="scatter",
name="economic benefit of a master degree",
x=benefits['Occupation'],
y=benefits["M benefit"],
marker={"color": "orange"},
mode = 'markers',
# size = benefits['number of workers']
#remove ticks on x axis
)
salary3 = dict(type="scatter",
name="economic benefit of a doctorial degree",
x=benefits['Occupation'],
y=benefits["D benefit"],
marker={"color": "red"},
mode = 'markers',
# size = benefits['number of workers']
#remove ticks on x axis
)
#marker = dict(size = benefits['number of workers'])
layout = dict(
yaxis={"title": "economic benefits(in $)"},
title="Economic Benefits for Different Oppupations",
)
iplot(go.Figure(data=[salary1,salary2,salary3],layout=layout))
From the plot, it shows that Bachelar degree is most helpful in increasing salaries. Master and doctor degrees are more helpful to people with jobs in finance, research and education field(left half of the graph) than blue collars(right half of the graph). One interest point is the highest red dot. A doctor degree has huge impact on lawyers.
In [14]:
#get a new table with occupations related to business, finance, operation and accounting
b = benefits[benefits.Occupation.str.contains('Financial')|benefits.Occupation.str.contains('operation')|\
benefits.Occupation.str.contains('business')|benefits.Occupation.str.contains('account')|\
benefits.Occupation.str.contains('financial')]
#see the data
b
Out[14]:
In [15]:
#plot
s1 = dict(type="bar",
name="economic benefit of bachelor degeree",
x=b['Occupation'],
y=b["B benefit"],
marker={"color": "yellow"}
)
s2 = dict(type="bar",
name="economic benefit of master degree",
x=b['Occupation'],
y=b["M benefit"],
marker={"color": "orange"},
)
s3 = dict(type="bar",
name="economic benefit of doctorial degree",
x=b['Occupation'],
y=b["D benefit"],
marker={"color": "red"},
)
#marker = dict(size = benefits['number of workers'])
layout = dict(
yaxis={"title": "economic benefits (in $)"},
title="Economic Benefits for Oppupations in Business and Finance Field",
)
iplot(go.Figure(data=[s1,s2,s3],layout=layout))
The picture shows that bachelor degrees are of great help to business people. Master degrees are helpful to people in finance field, but not that helpful to people who have the work related to accounting. Doctor degrees are helpful to people who work as analysts and specialists, but add less value compared with bachelor or master degrees.
After examing the economic benefits of getting a degree, we now focus on the cost paid to get a degree. The cost we consider here is tuition. Unlike the payment for future jobs, people usually know exactly how much they need to pay for a certain school they are accepted. As a result, I am only going to show a general picture of college tuitions without seperating colleges according to their locations or major focuses. The annual tuition here is just a rough estimation of the cost.
Data Source:
The website leads to an online database. The important key words I chose are: use final release data; select institutions: by group—U.S. only; select variables: tuition and fee (2015-16). After selecting these key words, the database will provide a csv file with the data needed. I uploaded the file to Github.
In [16]:
data_tuition = pd.read_csv('https://raw.githubusercontent.com/yuxinXia96/final-project/master/CSV_1262016-477.csv')
In [17]:
#drop the columns with year info and ID info
data_tuition = data_tuition.drop(['unitid','year'],axis=1)
#drop rows with missing data
data_tuition = data_tuition.dropna()
#rename columns
data_tuition = data_tuition.rename(columns = {'institution name':'institution',
'DRVIC2015.Tuition and fees, 2015-16':'tuition'})
#calculate average tuition
data_tuition['mean'] = data_tuition['tuition'].mean()
In [18]:
#plot
t = dict(type="scatter",
name="tuition 2015-16",
x=data_tuition['institution'],
y=data_tuition["tuition"],
marker={"color": "pink"},
mode = 'markers',
)
m = dict(type="scatter",
name="mean",
x=data_tuition['institution'],
y=data_tuition["mean"],
marker={"color": "brown"},
mode = 'lines',
)
layout = dict(
yaxis={"title": "tuition(in $)"},
title="Tuition 2015-16",
)
iplot(go.Figure(data=[t,m],layout=layout))
From the picture, it shows that the average tuition is around 15K per year, and tuition is almost within the range from 5K to 50K.
Recall the pictures we did previously, the benefits of barchalor degrees are mostly above 1.5M. Thus, going to a college is worthwhile even without considering the gain of knowledge and improvement of social networks. Also, the annual salary increase doesn’t consider the lifelong benefits from education. The increases in earnings will last much longer than the school years. So the actual economic benefits will be bigger than what we can see from the graph.
In [ ]: